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ABSTRACT 

With the randomization approach, sensitive data items of 
records are randomized to protect privacy of individuals while 
allowing the distribution information to be reconstructed for 
data analysis. In this paper, we distinguish between recon- 
struction that has potential privacy risk, called micro recon- 
struction, and reconstruction that does not, called aggregate 
reconstruction. We show that the former could disclose sensi- 
tive information about a target individual, whereas the latter 
is more useful for data analysis than for privacy breaches. To 
limit the privacy risk of micro reconstruction, we propose a 
privacy definition, called {e, 5) -reconstruction-privacy. Intu- 
itively, this privacy notion requires that micro reconstruction 
has a large error with a large probability. The promise of 
this approach is that micro reconstruction is more sensitive 
to the number of independent trials in the randomization 
process than aggregate reconstruction is; therefore, reduc- 
ing the number of independent trials helps achieve (e, &)- 
reconstruction-privacy while preserving the accuracy of ag- 
gregate reconstruction. We present an algorithm based on 
this idea and evaluate the effectiveness of this approach us- 
ing real life data sets. 



1. INTRODUCTION 

Randomization is one of the promising approaches in privacy- 
preserving data mining. With this approach, sensitive data 
items in records are randomized to protect the privacy of in- 
dividuals while allowing the distribution information to be 
reconstructed with reasonable accuracy. An early use of ran- 
domization is randomized response (RR) for collecting re- 
sponses on sensitive questions 19^. For example, to find the 
percentage of employees stealing from the company, the em- 
ployer asks each employee the question "do you steal from the 
company?". To prevent linking the responder to his/her sen- 
sitive response, each employee submits the true answer ( "Yes 
or "No") with a certain retention probability p and submits 



an answer chosen from {Yes, No} at random with probabil- 
ity (1 — p)/2. This type of randomization, also called input 
perturbation, is extended to categorical values in privacy pre- 
serving data mining for mining association rules [8l [2l l9l [17]. 
Randomization is also studied in privacy preserving data pub- 
lishing where a data publisher has collected the original data 
D and wants to release a sanitized version D* for data mining 

[siiniiiiiiiii]. 

In this paper, we consider the data publishing scenario in 
which the data set D contains both non-sensitive attributes 
(e.g., age, gender, etc.) and a sensitive attribute (e.g., dis- 
ease), as in most realistic settings. We assume that an ad- 
versary has named a target individual, t, whose record is con- 
tained in D, and has figured out somehow the non-sensitive 
attributes of t. The adversary's goal is to infer the sensitive 
attribute of t. To preserve the privacy of individuals, the sen- 
sitive attribute value in each record is randomized following 
a certain retention probability p, while allowing reconstruc- 
tion of distribution information such as the count of records 
in D satisfying a given predicate if. We show that, with the 
help of non-sensitive attributes, the adversary could recon- 
struct the distribution of the sensitive attribute for a target 
individual, even if major privacy definitions are satisfied. If 
this distribution is skewed, the target individual's privacy is 
breached. This attack is termed "reconstruction attack". 

1.1 Reconstruction Attacks 

One major privacy definition is limiting the change in ad- 
versary's confidence in the sensitive value a; of a given record 
as a result of interacting with or exposure to the database. 
For example, the pi-p2 privacy proposed in [S] states that 
if the prior probability Pr[X = x] is not more than pi, the 
posterior probability Pt[X — x\Y = y], given the published 
data D* , should not be more than p2, where pi < p2 and X 
and Y are the variables for the original and perturbed sen- 
sitive values in a record, respectively. In the literature [HI [31 
[21 [22l [4], Pt[X = x] is measured by the fraction of records 
with X = X in the whole table D, and Pr[X — x\Y — y] is 
measured by the fraction of records with X = x among the 
records with Y = y in the whole table D* . Precisely, 



Pr[X ^x\Y = y] 



Pt[X = a;] ■ p[x — >• y] 
j:^Pr[X ^ x] ■ p[x y] 



where p[x — >■ y] is the probability that x is perturbed to 
y, and can be determined by the retention probability p. 
Note that these measurements do not take into account the 



non-sensitive attributes of records in D or the acquired non- 
sensitive information about the target individual t. The next 
example shows that with non-sensitive information, the ad- 
versary could infer the sensitive information of t with a prob- 
ability higher than p2, even if p\-p2 privacy is ensured. 

Example 1 (Attacks on pi-p2 privacy). Let D con- 
tain 10 X fc records over the sensitive attribute Disease and 
the non-sensitive attributes {Gender, Age}, where k is an in- 
teger and Disease has the domain {xi,--- ,2;io}. Suppose 
that k records in D have Gender — M and Age = 30, all of 
which have the value xi for Disease. Let g denote this set of 
records. X2-Xio are uniformly distributed among the remain- 
ing 9 X k records in D. Note, for 1 < i < 10, Pr[X = Xi] = 
10%, and 0.1-0.5 privacy ensures Pr[X = Xi\Y = y] < 50% 
for all Xi . This level of privacy can be achieved by retaining 
the original value in a record with probability 50% and per- 
turbing Xi randomly to a different value (i.e., {x2, ■ ■ ■ ,xio}) 
with probability (1 — 0.5)/9 O Let g* denote the ran- 
domized version of g. 

Suppose that an adversary wants to infer the disease of 
the target individual t = Bob having the non-sensitive infor- 
mation Gender = M and Age — 30. The adversary could 
estimate the (relative) frequencies of xi,--- ,x-i_o in g based 
on g* , instead of D* , because all other records in D* do not 
match t 's non-sensitive information. Let {F{, ■ ■ ■ , Fig) be the 
estimated frequencies ing. For a sufficiently large g (by using 
a large k) and a reasonable estimator such as the maximum 
likelihood estimator (MLE), Fl will be sufficiently close to 
the true frequency fi f3l, which is 100%. Consequently, the 
adversary is able to infer that t has the disease xi with a 
probability larger than p2 — 0.5. 

A recent breakthrough in privacy definition is differential 
privacy [7]. The idea is hiding the presence or absence of a 
participant in the database by making two neighbor data sets 
(nearly) equally probable for giving the produced query an- 
swer. Precisely, the A-differential privacy mechanism ensures 
that, for any two data sets D and D' differing on at most one 
record, for all queries Q, and for all query outputs o', 

Pr[K(D, Q) = o'] < exp{\) Pi:[K{D', Q) = o] 

With a small A, exp{\) is close to 1, so D and D' are almost 
equally likely to be the underlying database that produces 
the final output of the query. To ensure this property, the 
A-differential privacy mechanism adds the noise ^ to the true 
answer o and publishes the noisy answer o' = o -\- where ^ 
follows the Laplace distribution Lap{b) — ^exp{ — ^), b = 
1/A. The next example shows that such noisy answers can 
be exploited to estimate the likelihood of the sensitive value 
for a target individual. 

Example 2 (Attacks on differential privacy). Consi 
the D and t again in Example Q] An adversary could in- 
fer the distribution of Disease for t by issuing two queries 
Qi and Q2: Qi asks for the count of records that satisfy 
"Gender = M A Age = 30 " and gets the noisy answer Oi — 
oi + ^1, and Q2 asks for the count of records that satisfy 
"Gender = M AAge = 30 ADisease — xi " and gets the noisy 
answer 02 = 02 -1-^2, where Oi are the true answers and ^i are 
the noises added, i = 1,2. Note that the relative error ^ gets 
smaller as the true answer Oi gets larger, because ^i has the 



zero mean and the variance 2b^ , where b = 1/X is a constant 
for a given X-differential privacy mechanism. Therefore, as 
the answer oi increases, o'2/o'i approaches 02/ oi, the fraction 
of records having x\ among the records that share the gender 
and age with t. This discloses the disease x\ of t because 
02/01 = 100%. 

In these examples, randomized data or noisy query answers 
are used to reconstruct the distribution of sensitive informa- 
tion for a target individual, even though strong privacy defi- 
nitions are satisfied. If such reconstruction is accurate and if 
the true distribution is skewed, as in these examples, the re- 
constructed distribution discloses the sensitive information of 
the target individual with a high probability. This attack is 
powerful in that it works on different types of randomization 
techniques and data sharing scenarios, i.e., the random value 
replacement in Example [1] through either input perturbation 
or data publishing; random noise addition to query answers 
in Example (2] also known as output perturbation. 

1.2 Contributions 

The contributions in this work are as follows. 

• For the first time, we consider the implication of non- 
sensitive attributes on sensitive reconstruction of data 
distribution from randomized data. We distinguish two 
types of reconstruction: The micro reconstruction seeks 
to reconstruct the distribution of the sensitive attribute 
in a set of records that fully match a target individ- 
ual on all non-sensitive attributes; the aggregate recon- 
struction aims to reconstruct the distribution in a set 
of records that only partially match a target individual. 
We argue that micro reconstruction is all we have to be 
concerned with about privacy risk. 

• To address the privacy risk of micro reconstruction, we 
propose a notion of {e, 5) -reconstruction-privacy to en- 
sure a minimum value on the tail probabilities of micro 
reconstruction error. We present a bound conversion 
theorem that converts between a bound on tail probabil- 
ities of a random variable and a bound on tail probabil- 
ities of reconstruction error, which allows us to leverage 
the Chernoff bound to develop a testable instantiation 
of {e, 5)-reconstruction-privacy. Since the bound con- 
version theorem does not hinge on the particular form 
of bounds, our approach can be instantiated to other 
upper bounds and modified to constrain lower bounds 
of tail probabilities. 

• The promise of this approach is that micro reconstruc- 
tion is more sensitive to the number of independent 
trials in the randomization process than aggregate re- 
construction, analogous to the fact that the first 10 coin 
fiips are more critical for the estimation of head prob- 
ability than the second 10 coin flips. We leverage this 
difference to design an algorithm for achieving (e, 5)- 
reconstruction-privacy while preserving the utility of 
aggregate reconstruction. 

• Empirical evaluation on real life data sets presents two 
important findings: Firstly, {e, 5)-reconstruction-privacy 
is violated even when major privacy definitions such 
as pi-p2 privacy and diff'erential privacy are satisfied. 



Secondly, the additional information loss incurred for 
achieving (e, 5)-reconstruction-privacy is small. 

The rest of the paper is organized as follows. Section 2 
reviews related work. Section 3 defines the problem stud- 
ied in this work. Section 4 presents an efficient instantia- 
tion of (e, 5)-reconstruction-privacy. Section 5 presents the 
algorithm to achieve (e, (5)-reconstruction-privacy. Section 6 
presents empirical findings. Finally, we conclude the paper. 

2. RELATED WORK 

Two classes of randomization methods have been exten- 
sively studied in the literature: random perturbation and ran- 
domized response. Random perturbation is primarily used 
for quantitative data. For example, Agrawal and Srikant [T] 
build accurate decision tree classification models on the per- 
turbed data, and Kargupta et al. [H] point out that arbitrary 
randomization can reveal significant amount of information 
under certain conditions. Randomized response is primar- 
ily used for categorical data. Its basic idea was proposed by 
Warner [19], and based on this technique the problem of min- 
ing association rules from disguised data was studied in [SI [21 
117) . In this paper, the term "perturbation" or "randomiza- 
tion" refers to the randomization for categorical data. 

Techniques for probabilistic perturbation have also been 
investigated in the statistics literature. The PRAM method 
[lUj considers the use of Markovian perturbation matrices. 
The disclosure risk is measured by a notion of expectation 
ratios, defined as the ratio of the expected number of records 
in the perturbed file with the observed value equal to the 
value in the original file, and the expected number of records 
in the perturbed file with the observed value not equal to the 
value in original file. 

Formal definitions of privacy breaches were proposed in 
[H [6] following the same paradigm: for every record in the 
database, the adversary's confidence in the values of the given 
record should not significantly increase as a result of inter- 
acting with or exposure to the database. Recent works based 
on such definitions include [H [HI E [21 1121 H]- These ap- 
proaches either consider one attribute (i.e., the sensitive at- 
tribute) [11], or assume that all attributes are sensitive [16l l3l 
12], or ignore the role of non-sensitive attributes in the recon- 
struction of the sensitive attribute in the context of privacy 
risk [221 14]. Reconstruction of data distribution is tradition- 
ally considered as utility. To our knowledge, our work is the 
first to study such reconstruction as privacy breaches. 

An alternative to the randomization approach is the par- 
tition based approach in which the records are partitioned 
to ensure some sort of balanced distribution of sensitive data 
items in each partition [141121] . The randomization approach, 
due to its non-deterministic nature, is more robust to auxil- 
iary information [T3l[20l[T8] . 

The differential privacy mechanism [7] hides the presence of 
a single record in the database by adding random noises to a 
query answer. As we will see in Section 6, such noise addition 
is not sufficient to prevent the adversary from reconstructing 
the distribution of sensitive data for a target individual. 

3. PROBLEM STATEMENT 

We assume that the data publisher has collected a table 
D{NA,SA) on non-sensitive attributes NA — {Ai, ■ • ■ ,Ad} 



and one sensitive attribute SA. Each record in the table 
corresponds to a participant or individual. For a record r in 
D, r[NA] and r[SA] denote the values of r on NA and SA. 

■ I denotes the cardinality of a set. The sensitive attribute 
5*^4 has a discrete domain {xi,--- ,Xm}. The count of Xi 
refers to the number of records having Xi, and the frequency 
of Xi refers to the percentage of records having Xi. As in [S] 
[HI [U [TT] , we assume that the SA value in a record is chosen 
independently at random according to some fixed probability 
distribution. The publisher allows the researcher to learn this 
distribution, but wants to hide the SA value of an individual 
record. 

3.1 Perturbation 

We consider the data publishing scenario where the data 
publisher wants to publish D for data analysis, but wants to 
hide the SA value in a record. In the uniform perturbation [3] 
[HI [2] [TT] , the SA value x in a record is processed by flipping 
a coin with head probability < p < 1, called retention prob- 
ability. If the coin lands on heads, x is retained; otherwise, 
X is replaced with a random value from the domain of SA, 
where each value is selected with probability (1— p)/m. This 
perturbation process is parameterized by the perturbation 
matrix Pmxm: 

p ^ J P + if j=i (retain x,) , , 

\ if j^i (perturb Xr to xj) ^ ' 

p -\- is the sum of the probability that Xi is retained 
and the probability that Xi is replaced with the same Xi. 
Let D* contain all perturbed records. For any subset 5* of 
D, S* denotes the same set of records as S in D* . Note 
\S*\ = \S\. The choice of p dictates the trade-off between 
the privacy concern of hiding the sensitive value in a record 
and the utility for reconstructing the distribution of SA. The 
work in [SJ [1] determines the maximum retention probability 
p for ensuring a given pi-p2 privacy ,81 based on pi,p2, and 
m. 

The above perturbation process has some interesting prop- 
erties. First, it modifies only the SA attribute, not NA 
attributes. Therefore, data analysis involving only NA at- 
tributes incurs no information loss by accessing the random- 
ized data D* . This is an advantage compared to the dif- 
ferential privacy mechanism [7] where a query answer will 
be distorted even if it only involves non-sensitive attributes. 
Second, the perturbation of a record depends on the original 
SA attribute in the record, but not on any other records in 
D. Therefore, for any subset S of records from D, we can 
assume that S* is produced by the same perturbation matrix 
P. This record independence also implies that insertion and 
deletion of records on D can be done through insertion and 
deletion of randomized records on D* . 

We consider data analysis through answering count queries. 
A count query has a predicate tp of the form /\{A = a), where 
A is either SA or an attribute in NA, and a is a value from 
the domain of A. The answer to the query is the count of 
the records in D satisfying ip. This answer must be estimated 
using D* . If ip contains no equality for SA, the answer on D* 
is exactly same as the answer on D. If tp contains an equality 
SA = Xi, a reconstruction process will be applied to the sub- 
set of records in D* that satisfy (p^ , where ip^ is (p with the 
equality SA = Xi removed. Let S* be this subset and let 5* 
be the set of corresponding records in D. The reconstruction 



seeks the most likely estimator of the distribution of SA in 

S, denoted by F' , given S* and the perturbation operator P. 
The answer for the query is estimated by l^l-F/, where is 

the component of F' for Xi. The detailed reconstruction will 
be discussed in Section 4.1. 

3.2 Adversaries and Micro Reconstruction 

We assume that an adversary has named some target in- 
dividual, denoted t, whose record is contained in D, and has 
figured out t's values on all non-sensitive attributes NA. To 
infer the SA value of t, the adversary needs to reconstruct 

the frequencies F' of SA values from the randomized data 
D* . Given the knowledge about t's information on all non- 
sensitive attributes in NA, the adversary would focus the 
reconstruction process on the records in D* that match all 
Vs non-sensitive attributes. The next definition formalizes 
this reconstruction. 

Definition 1 (Micro/ Aggregate Reconstruction). 
A micro group is a set of the records in D that agree on all 
attributes in NA. The micro reconstruction seeks to recon- 
struct the distribution of SA in a micro group. For a target 
individual t, gt denotes the micro group containing t 's record 
and g* denotes the set of corresponding records in D* . An 
aggregate group is a set of the records in D that agree on 
zero or more but not all attributes in NA. The aggregate 
reconstruction seeks to reconstruct the distribution of SA in 
an aggregate group. 

The intent of distinguishing these two types of reconstruc- 
tion is that micro reconstruction is all we have to be con- 
cerned with about privacy risk - aggregate reconstruction 
does not present privacy risk. The next example illustrates 
this point. 

Example 3. Let NA = {Gender, Job} and SA — Disease. 
Consider a target individual t (say Bob) with Gender = Male 
and Job — Teacher. The micro group for t, gt, contains 
all records in D with Gender = Male and Job = Teacher. 
The micro reconstruction for t seeks to reconstruct the dis- 
tribution of SA in gt using the published gt . This recon- 
struction is most relevant to t because gt contains all and 
only the records in D that match t's non-sensitive informa- 
tion. In contrast, aggregate reconstruction involves records 
that do not match t' information in at least one of Gender 
and Job, such as (1) all records for Job = Teacher, or 
(2) all records for Gender = Female, or (3) all records for 
Gender = Female A Job = Teacher, or (4) all records in D. 
These reconstructions are less relevant to t because they are 
based on more records that do not belong to t. For example, 
a high estimated frequency of Breast Cancer in (1) does not 
mean that t has a high chance of getting Breast Cancer be- 
cause most occurrences of Breast Cancer actually come from 
female teachers. 

In the above, we distinguish two types of reconstruction 
based on the set of records in which the data distribution is 
estimated. For each type of reconstruction, we can distin- 
guish two types of estimates based on the records used to 
derive the estimate. In Example [3l we estimate the distribu- 
tion of SA in gt based on the records in g*. Alternatively, 
we can treat gt as the difference X — Y oi two sets X and 



Y , where S C X and Y = X ~ S, and estimate the distri- 
bution of SA in gt based on the estimates for X and Y . For 
example, for the set of male teachers, gt, gt ~ X — Y , where 
X is the set of all teacher records in D and Y is the set of 
all female teacher records in D. If Fx and Fy are the esti- 
mated frequencies of Breast Cancer in X and Y based on X* 
and Y* , respectively, we can estimate the frequency of Breast 
Cancer in gt by iFx\X\ - FY\Y\)/\gt\. The next definition 
summarizes these two types of estimation. 

Definition 2 (Local/Global Estimates). For any sub- 
set S of D and any SA value x, the local estimate for x wrt 
S is based on the information in S* , and a global estimate 
for x wrt S IS given by {Fx\X\ - Fy\Y\)/\S\, where S CX, 
Y = X — S , and Fx and Fy are the local estimates of x wrt 
X and Y , respectively. 

Every local estimate is a global estimate in the special case 
oi X = S and y = 0. At first glance, there is a temptation 
for considering global estimates because the use of a superset 
X* is in favor of accurate reconstruction. However, we will 
show that all global estimates are in fact equal to the local 
estimate in Section 4.1. 



Table 1: Notations 



Symbols 


Meaning 


m 


the domain size \SA\ 


t 


a target individual 


S 


a subset of records in D 


S' 


the corresponding set of S in D* 


g 


a micro group 


g* 


the corresponding set of g in D* 


X 


a domain value of 5*^4 


f 


the frequency of a; in S* 




the variable for the observed count of x in S* 


F' 


the variable for the local estimate of / 


y,F', a* 


the column-vectors of /, F' , O* 


p 


the perturbation matrix in Equation (1) 


p 


the retention probability 



3.3 Problems 

We are now ready to define the problem we will study. 
We adapt the notation in Table [T] in the rest of the paper. 
For each target individual t, the micro reconstruction for t 
reconstruct the distribution of SA most relevant to t. If the 
distribution is skewed and if the reconstruction is accurate, t's 
SA information will be disclosed. To limit this privacy risk, 
the next definition formalizes a privacy definition through 
bounding the accuracy of micro reconstruction. 

Definitions ((e, (5)-reconstruction-privacy). For a 
micro group g, g* is (e, 5) -reconstruction-private, where e > 
and 5 £ [0, 1], if for each SA value x occurring in g, whenever 

Pr [-S^ > e] < f/ or Pr [-^^ < -e] < L, 5 < min{U, L}, 

where f is the frequency of x in g and F' is the variable for 
a global estimate of f over the random instances of g* . D* 
is (e, 5) -reconstruction-private if g* is (e, 5) -reconstruction- 
private for every micro group g. 



Remark 1. (£,5) -reconstruction-privacy ensures that the 
(best) upper bounds on tail probabilities for micro reconstruc- 
tion error greater than e or smaller than —e are not smaller 
than 5. In this sense, the adversary has difficulty to lower 
the probabilities of a large estimation error. The larger the 
parameters e and 5 are, the greater this difficulty is and the 
more secure the published data ts. In this definition, 5 is a 
constraint on the upper bounds of tail probabilities (i.e., U 
and L). This formulation allows us to leverage the extensive 
research on upper bounds of tail probabilities in the literature. 
Alternatively, 5 could be a constraint on the lower bounds of 
tail probabilities if such bounds are available, and from Theo- 
rem\^ our approach does not hinge on whether U and L are 
upper bounds or lower bounds. In this definition, we consider 
the estimate F' estimated from randomized data. In Section 
6.3, we will show that the same privacy notion can be ap- 
plied to F' estimated from noisy query answers such as those 
produced by the differential privacy mechanism. 

Definition 4 (The Problem). Given a data set D, a 
retention probability p for randomization, e, and 5, where e > 
and S £ [0, 1], we want to produce a randomized version D* 
that satisfies {e , 5) -reconstruction-privacy while information 
for aggregate reconstruction is preserved. 

Two main problems are to be solved: how to test if (e, S)- 
reconstruction-privacy is satisfied, and how to achieve (e, 5)- 
reconstruction-privacy on a given data set. We answer the 
first question in Section 4 and answer the second question in 
Section 5. 

4. TESTING PRIVACY 

We first present an estimation technique for F' and then 
present a probabilistic bound for the estimation error of F' . 
In the discussion below, the reader is referred to the notations 
in Table [1] 

4.1 Maximum Likelihood Estimator 

We adapt the maximum likelihood estimator (MLE) as our 
model of local estimates. The next theorem follows from 
Theorem 2 in [1. 

Theorem 1 (Theorem 2, [2J). For a subset of records 
S and any SA value x, F' computed by ■ ^ is the max- 
imum likelihood estimator (MLE) of f in S, under the con- 
straint T,F' = 1, where S is over all elements of F' . 

In the rest of the paper, F' denotes the MLE computed 
by Theorem [1] The presence of the matrix inversion 

makes it troublesome to compute F' and develop a proba- 
bilistic error bound for F' . The next lemma gives an efficient 
computation of F' . 

Lemma 1 (Computing F'). For any subset S of D and 
any SA value x, (i) E[0*\ = \S\{fp^{l-p)/m), (ii) F' = 

OV\S\-[-l-v)/m ^ ^^^^ ^ J 

Proof, (i) Let Xk be independent and identically dis- 
tributed (i.i.d.) indicator variables for the event that the 
fc-th row in S* has the SA value x. O* — SfcX^. From 



the matrix P in Equation lU, if the fc-th row in S has x, 
Xk = 1 with probability p -\- {1 — p)/m, and if the k-th row 
in S does not have x, Xk = 1 with probability (1 — p)/m. 
So E[0*] = \S\f{p + (1 - p)/m) + \S*\{1 - /)(1 - p)/m) = 
\S\{fp+ {1 - p)/m). This shows (i). 

(ii) From Theorem [T] F' = F"^ ■ Let [a]m denote a 
column-vector of the constant a of the length m. We have 



\s\ 



F' = pF' + 



A 



'^EF'U = pP + [- 



f 1 



The last equation holds because "^F' = 1 (Theorem [T}. 
Thus, ^ = pF' + i^, equivalently, F' = o VIS|-a-P)/^ ^ 
as required for (ii). 

(iii) Taking the mean on both sides of F' = oV\s\~(i~p)/m ^ 
we get E[F'] = mo']/\s\-{i-p)/,n ^ Substituting Ep"] in (i) 
into the last equation and simplifying, we get E[F'] = /. 
This shows (iii). □ 

From Lemma [ijii), F' can be computed directly from the 
observed count O* without computing the matrix inversion 
P~^. From Lemma [ijiii) , F' is an unbiased estimator of /. 
The next lemma shows that, for the MLE model of local 
estimates, all global estimates are equal to the local estimate. 

Lemma 2. For any subset S of D and any SA value x, 
every global estimate for x wrt S is equal to the MLE for x 
wrt S. 

Proof. From Definition (2] every global estimate wrt 5* 



has the form 



F'x\X\-F{.\Y\ 



where S C X C D and F = 



X - S, and F',F'x,Fy are the MLEs wrt S,X,Y, respec- 
tively. Let S* ,X* ,Y* be the sets of records in D* correspond- 
ing to S,X,Y, and let O* ,Ox ,Oy be the variables for the 
counts of x in S*,X*,Y*, respectively. From Lemma [Hii), 



Ox/i-yi-(i-p)/"' 



and Fy 

i-y|-fy|y| 



Oy/|y|-(l-p)/"i 



noting \S\ = \X\ 



Substi- 
Y\ and 



tuting these into 1^ 

O* =0*x-Oy, we get o' /\s\-(i-p)/m ^ ^j^j^j^ ^^^^^ ^-^^ 

MLE F' given by Lemma[TJii). This shows that every global 
estimate for x is equal to the MLE F' for x. □ 

Consequently, it suffices to consider only local estimates. 
The next definition refines Definition [3] by considering only 
local estimates and will be used in the remaining discussion 
about (e, (5)-reconstruction-privacy. 

Definition 5 ((e, (5)-reconstruction-privacy (Refined)). 
For any micro group g, g* is {e, S) -reconstruction-private, 
where e > and 5 G [0, 1], if .for each SA value x occurring 

in g, whenever Pr > ej < U or Pi < — ej < L, 

then 5 < min{U, L}, where f is the frequency of x in g and F' 
is the variable for the MLE of f under the constraint T.F' — 1. 

4.2 Probabilistic Error Bounds 

A remaining question is how to bound Pr > ej and 

Pr l^-^-j-^ < ■ We leverage tail probabilities of random 
variables in the literature to develop such bounds. Recall 
that O* is the observed count of a SA value and F' is the 
reconstructed frequency of a SA value. The next theorem 
gives a conversion between a probabilistic bound for F' and 
a probabilistic bound for O*. 



Theorem 2 (Bound Conversion). Consider any sub- 
set S of D and any SA value x. Let /i = E[0*]. For any 
upper tail hound function U{9, /i) and lower tail bound func- 
tion L{0,fi), and for any comparison operator ^ (i.e., < or 
>), 

1. Pr [^21^ > 6i] U(e, /i) if and only if Pr > ej 



ef/(^,/i) 



2. Pr 



< 



-ej 0L(6',^) if and only if [-^^ < -ej 



eL(^,M). 

Proof. We show (1) only because the proof for (2) is simi- 
lar. FromLemmamii), F' = ov\s\-0--^v)/m ^ ^ \S\{F' p+ 
(1 —p)/m), and from Lemma[TJi), fj, — \S\{fp-\- (1 —p)/m). 
So 



« |S'|p(F' - /) > 9m 
F' - f 9^1 



\s\pf 



These full Chernoff bounds are quite tight but can be clumsy 
to compute. Using the Taylor series expansion ln(l -\- 9) = 
— l)'"*"^^ and ignoring higher order terms, the above 
bounds can be simplified to the following weaker bounds, 
which covers 95% of cases pretty well: For 9 £ (0, oo), 

n2 



These rewriting implies that the probabilities on the two sides 
of (1) are equal. Then (1) follows because 9 = □ 

From Theorem[2l if we have a tail probability bound for the 
error of O* (i.e., U{9,^) and L{9,^)), we immediately have 
a tail probability bound for the error of F' (i.e., [/(^^|^,/i) 
and L(^^^|^, /i)). Moreover, if the bound for O* is the best, 
the corresponding bound for F' is also the best (otherwise, a 
better bound for O* can be obtained from Theorem [2]). Im- 
portantly, the bound conversion does not hinge on the par- 
ticular form of the bound functions U and L. This generality 
allows us to adapt to the best bounds U and L available for 
O* to get the best bounds for F' . 

There is a rich literature on the upper bounds for tail prob- 
abilities of random variables. The Markov's inequality ap- 
plies to any non-negative random variable, therefore, applies 
to O* . The Chebyshev's inequality uses knowledge of the 
standard deviation to give a tighter bound. However, these 
bounds are very poor for random variables that fall off expo- 
nentially with distance from the mean. The Chernoff bound, 
due to [5], gives exponential fall-off of probability with dis- 
tance from the mean. The critical condition that is needed 
for the Chernoff bound is that the random variable be a sum 
of independent Poisson trials. 

Theorem 3 (Chernoff Bounds, [5l[T5]). LetXi,--- ,X„ 
be independent Poisson trials such that for 1 < i < n, Xi G 
{0,1}, Pr[Xi = 1] = Pi, where < pt < 1. Let X = 
Xi-{---- + X„ and fi = E[X] = E[Xi] + ■■■ + E[X„]. For 
9 e (0,oo), 



Pr 



X -fi 



> I 



and for 9 G (0, 1], 
'X - fi 



Pr 



< -6 



< U,{9,ti) = 



<ii(e,M) = 



(l + 6'){i+e) 



)(i-e) 



(2) 



(3) 



Pr 



X 



> 



< U2(9, /i) = exp{- 



2 + 9 



and for 9 £ (0, 1], 
'X - 



Pr 



< -6 



< L2{9,^i) = exp{- — ^i). 



(4) 



(5) 



The Chernoff bound applies to our variable O* because 
O* is the sum Xi -\~ ■ ■ ■ -\- Xn, where each Xi is the indicator 
variable whether the i-th row in S* has a particular SA value 
X, said E[0*] — \S\{fp-\-{l—p)/m) (Lemma[l}. Instantiating 
the upper bounds Ui and Li for O* in Equations (2)-(5) into 
Theorem [2j the next corollary gives the corresponding upper 
bounds for F' . 

Corollary 1 (Upper bounds for F'). LetUi and Li 
be defined in Equations (2)-(5). For 9 G (0, oo), 

F' -f 



Pr 

and for 9 e (0,1], 
Pr 



/ 

F' -f 



f 



> e 



< -e 



<H9,fi) 



(6) 



(7) 



where 9 = and ^j, = \S\{fp+{l-p)/m). 

Corollary [1] gives the concrete upper bounds Ui and Li 
on the tail probabilities of F' based on the Chernoff bound. 
Since these bounds are public, (e, (5)-reconstruction-privacy 
implies 5 < min{Ui,Li}. The question is whether S < 
min{Ui,Li} is sufficient for (e, 5)-reconstruction-privacy, in 
other words, whether there are tighter (i.e., smaller) upper 
bounds than Ui and Li. To answer this question, we observe 
from Theorem [2] that any tighter bound for F' would lead 
to a tighter bound than the Chernoff bound for O*. The 
fact that the Chernoff bound has been used as the state- 
of-the-art technique in the past 60 years suggests that it is 
nontrivial to improve the Chernoff bound. For this reason, 
we assume that Corollary [T] gives the best upper bounds for 
F'; however, if better bounds on random variables become 
available, they can be easily adapted through Theorem [2] to 
obtain better bounds for F' . This observation leads to the 
following instantiation of (e, 5)-reconstruction-privacy based 
on the Chernoff bound. 

Corollary 2 (Testing (e, (5)-reconstruction-privacy) 
With the upper bounds Ui and Li m Equations (2)-(5), for a 
micro group g, for e G (0, 1 + ^'^—^y~~\ and 5 G [0, 1], g* is 
(e, 5) -reconstruction-private if and only if, for every SA value 
m g, 

5 <min{Ui{9,^i),L^{9,fi)} (8) 
where 9 = ^^^^ and fi = \g\{fp + (1 — p) /m) . 

The range (0, 1 + ^j^] of e corresponds to the common 
range (0, 1] of 9 for all of Equations (2)-(5). The condition in 
Equation (8) can be tested efficiently because all parameters 
in 9 and /i are known to the data publisher. 



5. ACHIEVING PRIVACY 

We now consider the second major question: how to achieve 
(e, 5)-reconstruction-privacy on the pubhshed data D* for a 
given data set D. CoroUary [5] gives an efficient condition for 
(e, 5)-reconstruction-privacy, but it does not provide a clue 
on how to achieve this condition if it fails. As the first step 
towards an answer, we rewrite Equation (8) into a constraint 
on the size \g\ of a micro group g. Then we present an algo- 
rithm to enforce this constraint. Below, we consider (Li, Ui) 
and {L2,U2) separately. 

Theorem 4. With the upper bounds Ui {9, /i) and Li{9, /i) 
in Equations (2) and (3), for a micro group g, e € (0, 1 + 
(i-p)/m g ^ [0,1], g* IS {e, 5) -reconstruction-private if 

ana only if, for the maximum frequency f of any SA value 
occurring in g, 

In 5 



\9\< 



w In 



(9) 



_(l_e)(i-e) 

where w — fp + {1 — p) jm and 6 = . 

Proof. First, we show two claims. Let X = 

and Y = j^^^- 

Claim 1: for 6 £ (Q,l\, X > Y , thus, min{Li,Ui} = Li. 
Note Y approaches 1 as S approaches 0. To show the claim, 
it suffices to show that y is non-decreasing, equivalently, the 
derivative of y wxt 9 is non- negative for 9 £ (0, 1]. Note 

In ^ = 261 + (1 - e») ln(l - 9) - {1 + 9) ln(l + 9) 

Differentiating both sides wrt 9, we get 

-1 



|(f )' = 2+[-ln(l-e)+(l-e)-^_^ 



ln(l+e) + (l+e)^] 



and 



(f )' = -f ln(l-e^)>0 



The last inequality follows because X and Y are non-negative 
and 9 is in (0, 1]. This shows Claim 1. 

Claim 2: for 9 £ (0, 1], Y is in (0, 1) and is non-increasing. 
We show that the derivative of Y is non-positive (thus, Y is 
non-increasing) for 9 £ (0, 1] . Then the claim follows from 
the fact that Y approaches 1 as S approaches 0. 

lnF = -6'-[(l-6»)ln(l-6i)] 

Differentiating both sides wrt 9 gives 



Y' = Y[-l - (- ln(l -9) + {l-t 



= yin(l -9)<0 



The last inequality follows because Y is non-negative and 9 
is in (0, 1]. This shows Claim 2. 

From Claim 1, L\{9,ij) < Ui{9,^), so Equation (8) degen- 
erates into 5 < L\{9,^) = y^, and In (5 < jilnY — \g\wlnY. 
From Claim 2, Y is in (0, 1), so luF < 0, and l^l < 
As f increases, w and 9 = — increase, and from Claim 

2, Y is in (0, 1) and is non-increasing, thus, InY is decreas- 
ing. Since both In 5 and InY are negative, is mini- 



mized when / is max;imized. So Equation (8) degenerates 
into Equation (9). □ 



Observe that the right-hand side of the condition in Equa- 
tion (9) is a constant if the maximum frequency / is kept 
unchanged. The idea of our algorithm to enforce this con- 
dition is reducing \g\ while keeping / unchanged. The next 
theorem gives a similar rewriting based on the bounds L2 and 
U2 in Equations (4) and (5). 

Theorem 5. With the upper bounds U2{9, fi) and L2{9, /i) 
m Equations (^) and (5), for a micro group g, e £ (0, 1 -I- 
^^—^jr-^], and 5 G [0,1], g* is {e , S) -reconstruction-private if 
and only if, for the maximum frequency f of any SA value 
occurring in g, 

, I ^ -21n5 

1^1 <- ^ (10) 

where w = fp + {1 — p) /m and 9 = . 

Proof. For 61 > 0, -£.2(6', ^i) < U2(9,^i), so Equation (8) 
degenerates into 5 < 1/2 (S, fJ-), where 9 = si^isJ- and /i - 



Note 9 = ^ = 



p+- 



As / increases, 9 and /i = \g\w in- 



crease, hence, L2{9,fi) = exp{—^fj,) decreases. Therefore, it 
suffices to consider the maximum frequency f in g for check- 
ing 5 < L2- The rest of the proof follows from the following 
rewriting: 



, , / 21n5 , , ^ -2\n5 
5 < expi^-f,) « M < ^ \g\ < 



□ 



In the rest of this section, we develop an algorithm for 
achieving (e, 5)-reconstruction-privacy based on Theorem O 
but a similar algorithm can be developed based on Theo- 
rem 131 According to Theorem E) if |(;| < Sg fails, where 
Sg = ^ , g* is not (e, (5)-reconstruction-private. There are 
several options to restore this inequality. One option is in- 
creasing Sg by reducing either the retention probability p or 
the maximum frequency f in g. Another option is decreasing 
\g\ by discarding some records. None of these options is de- 
sirable because they either make the data set more random 
or distort the global data distribution. 

Our observation is that \g\ in Equation (10) really refers 
to the number of independent Poisson trials in the random- 
ization process for generating g* . This can be seen from 
fi — E[0*] — \g\{fp-\- (1 — p)/m) (Lemma[TJi)) where \g\ is 
the number of indicator variables X^ for the event that the 
fc-th row in g* has a particular SA value x (see the proof 
of Lemma [J). Since the upper bounds in Equations (2)-(5) 
decrease exponentially in fi, reducing \g\ is highly effective to 
increase these upper bounds, which helps restore the inequal- 
ity in Equation (10), provided that the frequency / remains 
unchanged. At the same time, we want to preserve the fre- 
quency of each SA value to minimize the distortion to data 
distribution. To meet both requirements, we shall randomize 
a sample gi of g and scale the randomized data gl back to 
the original size \g\. The key is to preserve the frequency of 
each SA value in both sampling and scaling operations. This 
task is performed by the following three functions. Assume 
Iff! > Sg- 

1. Sampling{g, Sg): this function takes a sample of the size 
Sg from g such that the number of records for each SA 
value is reduced by the same fraction. Let b = Sg/\g\ 



(note b < 1). For each 5*^ value x occurring in g, let gi 
contain any [|(/i:|&J records from g^ and one additional 
record from g^ with probability \gx\b ~ Wgx\h\, where 
gx denotes the set of records in g for x. Note that all 
records in g^ are identical. Return g\ . This step reduces 
the number of independent trials to Sg while preserving 
the frequency of each SA value. 

2. Perturbing{gi,p,m): this function randomizes the SA 
values of the records in gi as described in Section 3.1 
and returns the randomized gl. 

3. Scaling{gi , \g\): this function scales up gi to the orig- 
inal size \g\ while preserving the frequency of each SA 
value. Let b' = \g\/\gi\- For each record r* in g*, let g| 
contain [6'J duplicates of r* and one additional dupli- 
cate of r* with probability b' — [6'J . Return grj . Note 
that the duplication does not increase the number of 
independent trails because all duplicates of t* originate 
from the same independent trial for r* . 

The algorithm based on the above idea is described in Al- 
gorithm[T] The input consists of D,p,m,e,5 and the output 
is DJ- For each micro group g, if \g\ < Sg, is equal to 
g* . Otherwise, is produced by the three steps on Lines 7-9 
described above. D2 contains all gj. 

Example 4. Suppose that a micro group g contains 5 records 
for xi and 15 records for X2. \g\ = 20, \gxi \ = 5, \gx2 1 = 15. 
Assume Sg = 15. Since \g\ > Sg, Sampling{g, Sg) produces 
a sample gi of g as follows, b — Sg/\g\ — 0.75. gi contains 
[5 X O.75J = 3 records from g^i and one additional record 
from gxi with probability 5 x 0.75 — 3 = 75%; gi contains 
[15 X O.75J — 11 records from gx2 and one additional record 
from gx2 with probability 15 x 0.75 — 11 = 25%. Suppose that 
after com flips, gi contains 4 records from g^i and 11 records 
from gx2- Perturbing{gi, p, m) produces the randomized ver- 
sion of gi, gl. 

Scaling{gl, \g\) scales up gl to the size \g\ as follows, b' = 
\g\/\gi \ = 20/15 = 1.33. For each record r* ing'l, contains 
[6'J = 1 duplicate of r* and contains one additional duplicate 
with probability 1.33— 1 = 33%. Suppose that after com flips, 
one additional duplicate for xi is chosen, and four additional 
duplicates for X2 are chosen. So g2 contains 5 records for xi 
and 15 records for X2. In general, IpJI rnay not be exactly 
equal to \g\. 

We show that D2 produced by Algorithm [T] satisfies some 
interesting properties with respect to privacy and utility. Con- 
sider a micro group g such that l^l > Sg. Let gi,gi,g2 be 
computed for g in Algorithm [1] and let O* , Og^ , Og^ be the 
observed count of a particular SA value x in g*,gt,g2- Let 
fg and fg^ be the frequency of a; in <; and gi. Let Fg,Fg^ , 
be the MLEs reconstructed from g*,gl,g2- u ~ v denotes 
that u and v are equal modulo the coin flips in Scaling and 
Sampling. It is easy to see a few simple facts: 

• Fact 1: /gj ~ fg, that is, Sampling preserves the fre- 
quency of a; in This is because the count of every x 
in g is reduced by the same factor b modulo the coin 
flips. 

• Fact 2: Og^/lpJI — Ogi/ISili that is, Scaling preserves 
the frequency of x in g*. This is because each record in 
gl is duplicated b' times modulo the coin fiips. 



Algorithm 1 Achieving Reconstruction Privacy 
Input: D,p,m,e,5 

Output: Randomized D2 that is (e, 5) -reconstruction-private 
1: L>2 <- 

2: for all micro groups g in D do 

3: compute Sg = * using Equation (|10|) 

4: if \g\ < Sg then 

5: g2 Perturbing{g,p,m) 

6: else 

7: gi Sampling{g, Sg) 

8: gl Perturbing{g\,p,m) 

9: gl ^ Scaling{gl,\g\) 
10: add gi to D2 
11: return _DJ 

Sampling{g, Sg): 
1: temp 
2: b^sg/\g\ 

3: for all 5*^ value x occurring in g do 
4: gx <— the set of records in g having x 
5: add to temp any [|(?a;|6J records from g^ 
6: add to temp one additional record from g^ with prob- 
ability \gx\b- [\gx\b\ 
7: return temp 

Perturbing {gi,p,m): 
1: temp 

2: for all record r in gi do 

3: let r* be r with SA perturbed with retention probabil- 
ity p 

4: add r* to temp 
5: return temp 

Scaling{gl, \g\): 
1: b' ^\g\/\gl\ 
2: temp 

3: for all record r* in gl do 

4: add to temp [b'J duplicates of r* 

5: add to temp one additional duplicate of r* with prob- 
ability b' - [6'J 
6: return temp 



• Fact 3: Fg-^ ~ Fg^ , that is, gl and gl give the same es- 

Oo /l9*l-(l-P)/'" 

timate of fg . This follows from Fg. = , 

j = 1,2 (Lemma[ljii)) and Fact 2. 

. Fact 4: E[0;^] ~ E[0;] and \g*\ ~ \gl\. \g*\ ~ \gl\ 
follows from Sampling and Scaling. From Lemma [l][i), 
E[0;] = |5|(/p+(l-p)/m)and£;[0;j = sg(/ip+(l- 
p)/m). Since Scaling duplicates each x occurrence in gl 
M times, i?[0*J ~ ^i?[0;j = \g\{f,p+{l-p)/m). 
Then Fact 1 implies E[0*g^] ~ Ep'g]. 

Theorem 6 (Privacy). For each micro group g, gl is 
(e, 5) -reconstruction-private. 

Proof. If \g\ < Sg, gl is (e, (S)-reconstruction-private (The- 
orem [5}. We assume \g\ > Sg. gl is (e, 5)-reconstruction- 
private because \gi \ ~ Sg^ (TheoremfS)). \gi \ ~ Sg^ follows be- 



cause ~ fg (Fact 1) implies Sg ~ Sgj^, and from \gi\ 



151 

Pr[ 



Facts 1 and 3 imply 



/l 



So, 



82 



> e] ~ Pr[- 



> el, and Pr[ 



< 



Pr[— — < — e]. Since is (e, (5)-reconstruction-private, 
so is □ 

Below, we show that F2 has the same mean as F' . Let S 
be any set of micro groups, and let S* and 52 be the sets of 
corresponding records in D* and DJi respectively. For any 
SA value x, let F2 denote the estimated frequency of a; in 5 
based on S2 and let F' denote the estimated frequency of x 
in S based on 5". 



Theorem 7 



— "^ges Og2 



E[F^],^E[F']. 

and O* = EjgsO*. Let 
= E96S IsSI- From Lemmalllii), 

\-{l-v)/m ^ g[0;]/|S||-(l-p)/m ^ 

p I 2J P ' 

From Fact 4, IS'*! ~ |S|| and _E[0*] ~ Epl], which implies 



(Utility 

Proof. Let O2 

1-5* I = Eggs l5* I and 
^ g[0-]/|g 



Despite E[F2] ~ -Bi-F'], ^2 will have a larger error than F' 
due to the reduced number of independent trials for SJ . This 
is exactly what we want in order to restore (e, (5)-reconstruction- 
privacy. However, the error increase for aggregate reconstruc- 
tion is smaller than that for micro reconstruction because ag- 
gregation reconstruction involves more than one micro group. 
We will evaluate this claim empirically in Section 6. 

6. EMPIRICAL EVALUATION 

This empirical study aims to answer two questions: The 
first question is "to what extent is (e, 5)-reconstruction- privacy 
violated assuming that major privacy definitions are satis- 
fied?". The second question is "what price will be paid 
for having (e, (5)-reconstruction-privacy?" Section 16.11 intro- 
duces our data sets and utility metrics. Section 6.2 presents 
the findings in the data publishing setting and Section 6.3 
presents the findings in the output perturbation setting. 
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the base table with Education as the sensitive attribute {SA) 
and the remaining attributes as the non-sensitive attributes 
(A*'^). OCC-n and EDU-n denote the samples of cardinal- 
ity n, where n = 1007^', 2007^', 3007^", 4007^", bOOK. Figure [1] 
shows the frequency distribution of SA for OCC-300K and 
EDU-300K. EDU-300K has a more skewed distribution than 
OCC-300K. 

Table 2: Number of Values in Attributes 



Attributes 


Domain Size 


Age 


77 


Gender 


2 


Education 


14 


Marital 


6 


Race 


9 


Work-class 


7 


Occupation 


50 



Count Queries. We evaluate the utility of data analysis 
through count queries of the following form 



SELECT COUNT (*) FROM D 
WHERE Ai ^ ai A ■ ■ ■ A Ad ^ ad ^ SA ■ 



(11) 



where {Ai, Ad} is a subset of non-sensitive attributes and 
Oj is a value from the domain oi Aj, j = 1, . . . ,d, and Xi is 
a value from the domain of SA. The answer to the query, 
denoted by ans, is the count of records in D that satisfy the 
predicate in the WHERE clause. Since our primary inter- 
est is in aggregate information, we consider only queries that 
have at least 0.1% selectivity, where the selectivity is defined 
as ans/\D\. This means that d is restricted to be 1, 2, or 3 be- 
cause a query for any larger d has a selectivity less than 0.1%. 
We generate a pool of 5,000 queries as follows. For each a 
query, we randomly select d from {1, 2, 3} with equal proba- 
bility and randomly select d non-sensitive attributes without 
replacement. For each attribute Ai selected, we randomly 
choose a value ai from the domain of Ai. Finally, we ran- 
domly choose a value Xi from the domain of SA and create a 
query following the template in Equation pif) . If the query 
has a selectivity of 0.1% or more, we add it to the pool. This 
process is repeated until the pool contains 5,000 queries. 

Table 3: Parameter Table 



SA of EDU-300K 



SA of OCC-300K 



Parameters 


Settings 


V 


0.1, 0.3, 0.5, 0.7, 0.9 


e 


0.1, 0.3, 0.5, 0.7, 0.9 


5 


0.14, 0.22, 0.3, 0.38, 0.46 


\D\ 


lOOK, 200K, 300K, 400K, 500K 



Figure 1: Frequency Distribution for SA 

6.1 Experimental Setup 

Data Sets. We utilize the real CENSUS data contain- 
ing personal information of 500K American adults, previ- 
ously used in [21], [14], and [4]. Table [2] shows the 7 dis- 
crete attributes of the data. Two base tables were generated 
from CENSUS. OCC denotes the base table with Occupa- 
tion as the sensitive attribute [SA) and the remaining at- 
tributes as the non-sensitive attributes (A^^). EDU denotes 



6.2 Findings in Data Publishing 

In the data publishing scenario, the randomized data D* is 
published and a query is answered using D* and a reconstruc- 
tion process as described in Section 3.1. Our study focuses 
on two questions: (i) To what extent is (e, 5)-reconstruction- 
privacy violated on D*? (ii) What additional price is incurred 
for the protection of (e, (5)-reconstruction-privacy? To answer 
the first question, we study the percentage of violating micro 
groups in D* that fail to satisfy (e, (5)-reconstruction-privacy. 
To answer the second question, we measure the (average) 




Figure 2: EDU: % of Violating Micro Groups in D* 




p e 5 |D| 



(a) vs. p (b) vs. e (c) vs. 5 (d) vs. \D 

Figure 3: EDU: Comparison of Relative Error for Count Queries 



relative error for answering the count queries in our query 
pool, defined as "^"'^""""^ , where ans is the true answer and 
est is the estimated answer. We compare the relative error 
generated using D2 produced by Algorithm 1, denoted by RP 
(for reconstruction privacy), with the relative error generated 
using D* produced by the standard uniform perturbation, de- 
noted by UP. Both methods use a retention probability p to 
randomize the data, thus, ensure some uncertainty of the SA 
value in a record such as pi-p2 privacy. According to [SI |3] 
[4], the maximum p for providing pi-p2 privacy is p = , 

where 7 = ^ x and m = \SA\. Additionally, RP has 

the parameters e and 5 for (e, 5)-reconstruction-privacy. We 
consider the settings of p, e, 5, and |D|, shown in Table O 
The default settings are in boldface. 

6.2.1 Findings on EDU Data Sets 

Figure [2] shows the percentage of violating micro groups in 
D* vs p, e, 5, and \D\. Here are several observations. Firstly, 
there is a nontrivial percentage of micro groups that violate 
(e, (5)-reconstruction-privacy. A larger retention probability 
p leads to more violating micro groups. The violation dimin- 
ishes when p becomes very small (i.e., less than 20%), but 
in this case aggregate reconstruction is affected significantly 
because D* is too noisy, as shown by the larger relative error. 
A larger e or 5 leads to more violating micro groups due to a 
more restrictive privacy constraint. A larger data cardinality 
\D\ leads to more violating micro groups. This is because a 
larger \D\ leads to a larger \g\, i.e., more independent trials 
when generating g* , thus, a more accurate reconstruction. 
In fact, \g\ <— is more likely to be violated as \g\ 

increases. 

For each experiment in Figure [2l Figure [3] shows the rela- 
tive error of UP and RP. Note that, in Figure E] (b)(c), UP 



remains constant because UP does not depend on e and S. 
The most significant finding is that, across all of p, e, 5, and 
\D\, the error of RP is only slightly more than the error of 
UP. This point can also be seen by cross-examining Figure [2] 
and Figure O the increase of error for RP is much slower than 
the increase in the percentage of violating micro groups. The 
reason is that the error boosting of RP through reducing the 
number of independent trails has less effect on queries that 
involve a large set of records. This finding supports our claim 
that the proposed method does not compromise the utility of 
aggregate information. For p and \D\, the trend in Figure [2] 
and Figure[3]is opposite: as p or \D\ increases, the percentage 
of violating micro groups increases, but the error of estimated 
query answers decreases. This makes sense because violating 
micro groups are caused by high accuracy of estimated query 
answers. 

6.2.2 Findings on OCC Data Sets 

We performed a similar study on the more balanced OCC 
data sets. Figure [4] shows the percentage of violating micro 
groups and Figure [S] shows the relative error, respectively. 
As we can see, the findings are quite similar to those of EDU 
data sets. 

6.3 Findings on Output Perturbation 

Although Definition [3] is based on reconstruction from a 
randomized data D* , the notion of (e, (5)-reconstruction-privacy 
is applicable to any reconstruction. In this experiment, we 
consider reconstruction from noisy query answers in the out- 
put perturbation scenario. We assume that differential pri- 
vacy Jj is in place. The A-differential privacy mechanism 
adds random noises ^ to the query answer o and publishes the 
noisy answer o' = o + S^, where ^ follows the Laplace distribu- 




Figure 4: OCC: % of Violating Micro Groups in D* 




(a) vs. p (b) vs. e (c) vs. 5 (d) vs. \D 



Figure 5: OCC: Comparison of Relative Error for Count Queries 



tion Lap{b) = ^exp{—^), b = 1/A. A determines the noise 
level. We show that even if differential privacy is satisfied, 
there is a concern about violation of (e, 5)-reconstruction- 
privacy. We use EDU-500K and OCC-500K. 

For each data set, we pick 7 micro groups g that have the 
largest maximum frequency / of any SA value, among those 
with \g\ > 70 for EDU-500K and \g\ > 100 for OCC-500K. 
For each of these groups, g, let / and F' be the true and 
estimated frequencies of the most frequent SA value x in g. 
F' is computed by the noisy answers to two queries Qi and 
Q2 constructed similar to those in Example (2] Let Oi be the 
true answer and let be the noisy answer for Qi, i = 1,2. 
/ = 02 /oi and F' = o'2/o'i. By treating Pr[-^^^^ > e] 

and Pr[ "'^ < — e] as the upper bounds of these proba- 
bilities themselves, (e, 5)-reconstruction-privacy is violated if 
Pr[-£^^ > e] < 5 or Pr[-^^^^ < -e] < S. To compute these 
probabilities, we generated the noisy answers o'j and 02 100 
times and considered the fraction of the cases 

and ^-j-^ < — e. The numbers for these cases are in Tables |4] 
andO 

Take Group 7 in Table[l(in boldface) for EDU-500K as an 
example. For A = 0.1 and e = 0.3, there are 8 cases for > e 
and 8 cases for < — e. Intuitively, this says that, out of the 100 
noisy answers (o'j, 02) examined, 8 cases have an error greater 
than 30% and 8 cases have an error less than —30%. In other 
words, the estimate F' falls within the ±30% interval with 
the confidence level of 84%. The privacy concern comes from 
the fact that the frequency of a; in gr is more than 70% (shown 
in the column "/ in g"), which is significantly higher than the 
2.5% in the whole data set D (shown in the column "/ in D" ). 
Thus, even if the ±30% interval is large, F' discloses a much 
higher probability of having x for the individuals in g than 



for the individuals in D. Similar disclosures are observed on 
the more balanced OCC-500K. For A = 0.1 and e = 0.2, 
Group 2 in Table[S](in boldface) shows a ±20% error interval 
with the confidence level of 84%. Although the frequency / 
of X in this group is only 47%, it is significantly higher than 
the frequency of 2.4% in the whole data set. Therefore, F' 
discloses quite a bit about the SA value of the individuals in 
this group. 

At A = 0.05, a larger error for F' has been observed due 
to the increased noise level. However, since A is a constant 
for a given A-differential privacy mechanism, the error for F' 
can be reduced by a sufficiently large group size and fre- 
quency / in g. To provide (e, ^)-reconstruction-privacy, the 
A-differential privacy mechanism has to employ a very small 
A. This solution shares the same drawback with the solu- 
tion of using a small retention probability p, i.e., choosing 
the global noise parameters, i.e., A and p, according to the 
worst case of any micro group in the data set. As discussed 
in Section 6.2.1, this type of solutions destroys both micro re- 
construction and aggregate reconstruction, making the data 
useless for all queries. 

7. CONCLUSION 

Reconstruction of data distribution is traditionally regarded 
as utility. In this work, we showed that reconstruction could 
lead to privacy breaches even if major privacy definitions are 
satisfied. We formalized a privacy definition to address this 
risk and presented an enforcement solution. A novelty of this 
work lies at the distinction between reconstruction that has 
privacy risk and reconstruction that does not. We leveraged 
this distinction to meet the dual requirement of privacy and 
utility. Another novelty is the independence on the partic- 
ular form of the bounds on tail probabilities. Our privacy 



Table 4: EDU-500K: the Number of Cases for > e and < -e 



Micro Group g 


Ifl 


/ in g 


/ in D 


A = 0.1 


A = 0.05 


£ = 0.2 


£ = 0.3 


£ = 0.2 


£ = 0.3 


> £ 


< -£ 


> £ 


< -£ 


> £ 


< -£ 


> £ 


< -£ 


1 


89 


0.87 


0.025 


18 


13 


14 


7 


34 


29 


24 


22 


2 


74 


0.77 


0.025 


25 


23 


14 


7 


32 


32 


28 


23 


3 


138 


0.76 


0.172 


11 


9 


8 


3 


18 


27 


20 


15 


4 


104 


0.76 


0.172 


21 


12 


9 


6 


35 


28 


22 


22 


5 


104 


0.75 


0.172 


23 


14 


11 


6 


35 


25 


21 


28 


6 


77 


0.74 


0.025 


26 


11 


18 


11 


26 


39 


27 


26 


7 


102 


0.72 


0.025 


18 


13 


8 


8 


32 


31 


29 


21 


Table 5: OCC-500K: the Number of Cases for Slj± > g and < -£ 


Micro Group g 




/ in P 


fmD 


A = 0.1 


A = 0.05 


£ = 0.2 


£ = 0.3 


£ = 0.2 


£ = 0.3 


> £ 


< -£ 


> £ 


< -£ 


> £ 


< -£ 


> £ 


< -£ 


1 


142 


0.48 


0.038 


13 


18 


11 


4 


29 


29 


27 


20 


2 


213 


0.47 


0.024 


8 


8 


3 





23 


19 


15 


11 


3 


111 


0.47 


0.026 


26 


18 


16 


13 


40 


30 


27 


29 


4 


113 


0.45 


0.024 


28 


20 


17 


8 


38 


30 


23 


25 


5 


153 


0.45 


0.026 


18 


15 


6 


6 


20 


40 


26 


21 


6 


237 


0.45 


0.024 


8 


9 


3 


3 


17 


21 


13 


12 


7 


143 


0.44 


0.038 


12 


17 


12 


6 


38 


34 


27 


20 



definition is a constraint on the upper bounds of tail proba- 
bilities and the Chernoff bound in particular. This formula- 
tion allows us to leverage the upper bound literature to de- 
velop a concrete solution to the problem identified. However, 
our approach can be instantiated to other upper bounds and 
modified to constrain the lower bounds of tail probabilities, 
thanks to the general form of the bound conversion theorem 
(Theorem [2J. 
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